Video 2.1: Data wrangling

WRANGLE that DATA

Cartoon of a small fuzzy monster with a cowboy hat and lasso, riding another larger fuzzy monster and lassoing a group of angry / unruly looking creatures labeled data.

Artwork by @allison_horst


Monsters are back

Our first order

To: Bot Bakery
Subject: Big ship treat order
Attached: crew_food_allergies.XLSX

Hi friend,

Here’s that table of my crew’s goody preferences. 😋

Guess which one is me!

Get the data

We can download the attached Excel file from: here

After you’ve downloaded the file, move the file from your Downloads/ folder to a data/ folder in your project’s main directory:

  1. Download the Excel file
  2. Create a “data” folder in your project directory
  • You can do this the usual way with Windows Explorer, or use the built-in New Folder button in the RStudio Files pane.
  1. Move the Excel file from your Downloads/ folder to the new data/ folder.

Excel in R

Have an EXCEL file? The readxl package has the function read_excel() for reading Excel data into R.

Let’s install and load readxl:

install.packages(readxl)

library(readxl)

Now we can read in our Excel file:

crew_df <- read_excel("data/crew_food_prefs.XLSX")

Ope! What happened to our column names / headers?


crew_df <- read_excel("data/crew_food_prefs.XLSX", skip = 1)


An Excellent alternative

Glimpse the data

glimpse(crew_df)

summary(crew_df)

filter() out the test data

There appears to be test data included in the table. Yuk! Let’s filter it out so it doesn’t influence our results.

crew_df <- filter(crew_df, id != "test")

select() what we need

There’s also some columns in here we don’t need. Let’s drop the country people are from since that won’t help us.

crew_df <- select(crew_df, -country)


Let’s create a separate smaller favorites table with only the id and favorite_goody columns.

favorites_df <- select(crew_df, id, favorite_goody)


Let’s create a separate allergy table with only the id and all of the columns that contain the word “allergy”.

allergy_df <- select(crew_df, id, contains("allergy"))


Finally, for easier reference, use select() to make the id and days_on_ship columns the first two columns in the table. Everything else can be in the same order.

favorites_df <- select(crew_df, id, days_on_ship, everything())

arrange() the crew

Let’s use arrange() to sort the crew members so the ones that will be on the ship the longest are on top. They’ll be the ones munching on are goodies the longest.

crew_df <- arrange(crew_df, days_on_ship)

Ope! What happened?

Descending order (biggest on top)

crew_df <- arrange(crew_df, desc(days_on_ship))

Alternative: use -column_name instead of desc(column_name)

crew_df <- arrange(crew_df, -days_on_ship)

mutate() new columns

Let’s get an idea of the total meals each crew mate will be eating baked goods. For that, we’ll need to take their days_on_the_ship times their meals_per_day.

# Use the * to multiply and the / to divide
crew_df <- mutate(crew_df, total_trip_meals = days_on_the_ship * meals_per_day)

Video 2.2: Forks in the road: if_else() and case_when()

if_else(days_on_ship > 7, true = “yes”, false = “no”)

if_else(days_on_ship > 7, true = “yes”, false = “no”, missing = “unknown”)

if_else(egg_allergy, 1,  0)

ifelse(egg_allergy, 1, 0)

Video 2.3: Boil it down // Show me the summary

Some like crawfish boils and others like data boils. Let’s boil down our data and find some summary statistics.

Back to Videos